Generalized Linear Models (GLMs)#

Generalized Linear Models extend linear regression in two key ways:

  1. Mean modeling via a link function

We don’t predict the mean \(\mu\) directly from a linear model. Instead we predict a linear predictor:

\[ \eta = X\beta \]

and connect it to the mean via an inverse link \(g^{-1}\):

\[ \mu = \mathbb{E}[y\mid X] = g^{-1}(\eta) \]
  1. Squared loss is replaced by a distribution-aware loss

Instead of assuming Gaussian noise and minimizing squared error, we choose a distribution from the exponential family (more precisely: an exponential dispersion model) and minimize its unit deviance (plus regularization).


Learning goals#

  • understand what GLMs are (and why they exist)

  • connect: distribution ↔ link ↔ loss (deviance)

  • see how common models are just GLMs:

    • linear regression (Normal + identity)

    • logistic regression (Bernoulli + logit)

    • Poisson regression (Poisson + log)

    • Gamma / Inverse Gaussian regression (positive skew + log)

  • implement IRLS (iteratively reweighted least squares) for logistic and Poisson regression

  • map the ideas to scikit-learn’s GLM estimators

Table of contents#

  1. Why GLMs? (what linear regression can’t do)

  2. The GLM recipe (exponential family + link)

  3. Deviance: “distribution-aware squared error”

  4. PDFs/PMFs of common GLM distributions

  5. Optimization intuition: IRLS (Newton as weighted least squares)

  6. Logistic regression as a GLM (Bernoulli)

  7. Poisson regression as a GLM (Poisson)

  8. Gamma / Inverse Gaussian (via Tweedie)

  9. Multiclass (categorical) GLM intuition

import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from scipy.special import expit
from scipy.stats import norm, poisson, gamma as gamma_dist, invgauss

from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, PoissonRegressor, GammaRegressor, TweedieRegressor

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)

1) Why GLMs?#

Linear regression predicts a real number:

\[\hat{y} = x^\top \beta\]

That’s perfect when \(y\) is continuous and (roughly) Gaussian.

But what if:

  • \(y\) is a probability (must be between 0 and 1)

  • \(y\) is a count (must be \(0,1,2,\dots\))

  • \(y\) is positive and skewed (e.g., time-to-failure, insurance claim size)

GLMs solve this by keeping the linear predictor (the “score”) but changing:

  • the mapping from score → mean (the link)

  • the loss (via the likelihood / deviance)

Anecdote:

A linear model is like a “thermostat knob” (it can turn freely). GLMs add the right display and physics so the knob controls the correct kind of quantity (probability, rate, positive mean, …).

# Same linear score η, different inverse links μ = g⁻¹(η)
eta = np.linspace(-8, 8, 500)

mu_identity = eta
mu_sigmoid = expit(eta)          # Bernoulli mean via logit link
mu_exp = np.exp(eta)             # Poisson mean via log link

fig = go.Figure()
fig.add_trace(go.Scatter(x=eta, y=mu_identity, name="identity (Normal)", mode="lines"))
fig.add_trace(go.Scatter(x=eta, y=mu_sigmoid, name="sigmoid (Bernoulli/logit)", mode="lines"))
fig.add_trace(go.Scatter(x=eta, y=mu_exp, name="exp (Poisson/log)", mode="lines"))
fig.update_layout(
    title="One linear score η = Xβ, many different means μ = g⁻¹(η)",
    xaxis_title="η",
    yaxis_title="μ",
    width=900,
    height=450,
)
fig.show()

2) The GLM recipe#

A GLM is defined by:

  1. A distribution for \(y \mid x\) from an exponential family / EDM

  2. A link function \(g\) connecting the mean to the linear predictor

\[ \eta = X\beta,\qquad \mu = \mathbb{E}[y\mid X] = g^{-1}(\eta) \]

Exponential family (one common form)#

A lot of useful distributions can be written as:

\[ \log p(y\mid \theta, \phi) = \frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi) \]
  • \(\theta\) is the natural parameter

  • \(\phi\) is a dispersion parameter

  • \(b(\theta)\) controls the mean/variance

In GLMs, the link often connects \(\mu\) to \(\eta\) in a way that is convenient (the canonical link is a common choice).

3) Deviance: “distribution-aware squared error”#

For Gaussian noise, minimizing squared error is equivalent to maximizing a Gaussian likelihood.

For general GLMs, we maximize likelihood for the chosen distribution. In many GLM-style estimators, this is written using the unit deviance \(d(y, \mu)\).

A typical regularized objective is:

\[ \min_{\beta}\ \frac{1}{2}\,\frac{1}{n}\sum_{i=1}^n d\bigl(y_i, \mu_i\bigr)\; +\; \frac{\alpha}{2}\,\lVert \beta \rVert_2^2 \]

If sample weights \(w_i\) are provided, the average becomes a weighted average.

Unit deviance cheat sheet (common EDMs)#

Below are standard unit deviances (with conventions like \(0\log 0 = 0\)).

Distribution

Target domain

Unit deviance \(d(y,\mu)\)

Normal

\(\mathbb{R}\)

\((y-\mu)^2\)

Bernoulli

\(\{0,1\}\)

\(2\left[y\log\frac{y}{\mu} + (1-y)\log\frac{1-y}{1-\mu}\right]\)

Categorical (multinomial, 1 trial)

one-hot

\(2\sum_k y_k\log\frac{y_k}{\mu_k}\)

Poisson

\(\{0,1,2,\dots\}\)

\(2\left[y\log\frac{y}{\mu} - (y-\mu)\right]\)

Gamma

\(\mathbb{R}_{+}\)

\(2\left[\frac{y-\mu}{\mu} - \log\frac{y}{\mu}\right]\)

Inverse Gaussian

\(\mathbb{R}_{+}\)

\(\frac{(y-\mu)^2}{\mu^2 y}\)

Interpretation:

  • Normal deviance is just squared distance.

  • Bernoulli deviance matches log-loss (up to a constant).

  • Poisson deviance compares counts in a way that respects positivity and relative error.

# Visual intuition: unit deviance as a function of μ for fixed y

mus = np.linspace(1e-4, 0.9999, 500)

# Bernoulli deviance for y=1 and y=0
bern_y1 = 2.0 * np.log(1.0 / mus)
bern_y0 = 2.0 * np.log(1.0 / (1.0 - mus))

# Poisson deviance for y=5 (μ must be > 0)
mu_pos = np.linspace(1e-3, 15, 500)
y_p = 5.0
pois_dev = 2.0 * (y_p * np.log(y_p / mu_pos) - (y_p - mu_pos))

fig = make_subplots(rows=1, cols=2, subplot_titles=("Bernoulli unit deviance", "Poisson unit deviance (y=5)"))

fig.add_trace(go.Scatter(x=mus, y=bern_y1, mode="lines", name="y=1"), row=1, col=1)
fig.add_trace(go.Scatter(x=mus, y=bern_y0, mode="lines", name="y=0"), row=1, col=1)
fig.update_xaxes(title_text="μ", row=1, col=1)
fig.update_yaxes(title_text="d(y,μ)", row=1, col=1)

fig.add_trace(go.Scatter(x=mu_pos, y=pois_dev, mode="lines", name="y=5"), row=1, col=2)
fig.update_xaxes(title_text="μ", row=1, col=2)
fig.update_yaxes(title_text="d(y,μ)", row=1, col=2)

fig.update_layout(width=1000, height=420, title="Deviance = mismatch measure tailored to the distribution")
fig.show()
# PDFs/PMFs: a quick visual gallery of common GLM distributions

fig = make_subplots(
    rows=2,
    cols=3,
    subplot_titles=(
        "Normal (PDF)",
        "Bernoulli (PMF)",
        "Categorical (PMF)",
        "Poisson (PMF)",
        "Gamma (PDF)",
        "Inverse Gaussian (PDF)",
    ),
)

# Normal
x = np.linspace(-4, 4, 400)
fig.add_trace(go.Scatter(x=x, y=norm.pdf(x, loc=0, scale=1), mode="lines"), row=1, col=1)

# Bernoulli
p = 0.3
fig.add_trace(go.Bar(x=[0, 1], y=[1 - p, p]), row=1, col=2)

# Categorical
probs = np.array([0.15, 0.25, 0.60])
fig.add_trace(go.Bar(x=["A", "B", "C"], y=probs), row=1, col=3)

# Poisson
lam = 4.0
k = np.arange(0, 16)
fig.add_trace(go.Bar(x=k, y=poisson.pmf(k, mu=lam)), row=2, col=1)

# Gamma
xg = np.linspace(1e-4, 10, 400)
shape = 2.0
scale = 1.0
fig.add_trace(go.Scatter(x=xg, y=gamma_dist.pdf(xg, a=shape, scale=scale), mode="lines"), row=2, col=2)

# Inverse Gaussian (SciPy parameterization)
xi = np.linspace(1e-4, 10, 400)
mu_param = 1.5
scale_param = 1.0
fig.add_trace(go.Scatter(x=xi, y=invgauss.pdf(xi, mu=mu_param, scale=scale_param), mode="lines"), row=2, col=3)

fig.update_layout(width=1100, height=650, title="Distribution shapes (PDF/PMF)")
fig.show()

4) Optimization intuition: IRLS#

For many GLMs (especially with canonical links), Newton’s method / Fisher scoring yields an update that looks like:

solve a weighted least squares problem, update \(\beta\), repeat

This is called IRLS (Iteratively Reweighted Least Squares).

A useful mental model:

  • a GLM is “like linear regression”, but the relationship between \(\eta\) and \(\mu\) is nonlinear

  • IRLS repeatedly builds a local quadratic approximation and solves a linear-looking problem

We’ll implement IRLS for:

  • Bernoulli + logit (logistic regression)

  • Poisson + log (Poisson regression)

def add_intercept(X: np.ndarray) -> np.ndarray:
    X = np.asarray(X, dtype=float)
    return np.c_[np.ones((X.shape[0], 1)), X]


def irls_logistic_regression(
    X: np.ndarray,
    y: np.ndarray,
    alpha: float = 0.0,
    max_iter: int = 50,
    tol: float = 1e-8,
):
    """Logistic regression via IRLS with optional L2 penalty (ridge).

    alpha: L2 strength (penalizes coefficients except intercept).
    """

    X = np.asarray(X, dtype=float)
    y = np.asarray(y, dtype=float)

    n, d = X.shape
    beta = np.zeros(d, dtype=float)

    losses = []
    reg = alpha * np.eye(d)
    reg[0, 0] = 0.0

    eps = 1e-12

    for _ in range(max_iter):
        eta = X @ beta
        mu = expit(eta)

        w = mu * (1.0 - mu)
        w = np.clip(w, 1e-9, None)

        z = eta + (y - mu) / w

        sqrt_w = np.sqrt(w)
        Xw = X * sqrt_w[:, None]
        zw = z * sqrt_w

        A = Xw.T @ Xw + reg
        b = Xw.T @ zw

        beta_new = np.linalg.solve(A, b)

        nll = -np.mean(y * np.log(mu + eps) + (1.0 - y) * np.log(1.0 - mu + eps))
        penalty = 0.5 * alpha * np.sum(beta[1:] ** 2)
        losses.append(float(nll + penalty))

        if np.linalg.norm(beta_new - beta) <= tol * (np.linalg.norm(beta) + tol):
            beta = beta_new
            break

        beta = beta_new

    return beta, np.array(losses)
# Logistic regression demo (2D)
X, y = make_blobs(
    n_samples=700,
    centers=[(-2.0, -1.0), (2.0, 1.0)],
    cluster_std=[1.6, 1.4],
    random_state=7,
)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=7, stratify=y)

fig = px.scatter(
    x=X_tr[:, 0],
    y=X_tr[:, 1],
    color=y_tr.astype(str),
    title="Binary classification dataset",
    labels={"x": "x1", "y": "x2", "color": "class"},
)
fig.update_traces(marker=dict(size=6, opacity=0.7))
fig.update_layout(width=720, height=470)
fig.show()
X_tr_i = add_intercept(X_tr)
X_te_i = add_intercept(X_te)

beta_hat, losses = irls_logistic_regression(X_tr_i, y_tr, alpha=0.5, max_iter=80)

# sklearn reference
sk = LogisticRegression(C=1/0.5, penalty="l2", solver="lbfgs", fit_intercept=True, max_iter=2000)
sk.fit(X_tr, y_tr)

pred_scratch = (expit(X_te_i @ beta_hat) >= 0.5).astype(int)
pred_sklearn = sk.predict(X_te)

print("Scratch IRLS logistic accuracy:", accuracy_score(y_te, pred_scratch))
print("sklearn LogisticRegression accuracy:", accuracy_score(y_te, pred_sklearn))
print("Scratch beta:", beta_hat)
print("sklearn  beta:", np.r_[sk.intercept_, sk.coef_.ravel()])
Scratch IRLS logistic accuracy: 0.9666666666666667
sklearn LogisticRegression accuracy: 0.9666666666666667
Scratch beta: [-0.0781  1.8208  1.0538]
sklearn  beta: [-0.0781  1.8208  1.0538]
# Convergence plot
fig = go.Figure()
fig.add_trace(go.Scatter(y=losses, x=np.arange(1, len(losses) + 1), mode="lines+markers"))
fig.update_layout(title="IRLS convergence (avg NLL + ridge)", xaxis_title="iteration", yaxis_title="loss", width=700, height=420)
fig.show()
def plot_probability_surface_2d(model_proba_fn, X, y, title: str, grid_steps: int = 220):
    x_min, x_max = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0
    y_min, y_max = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0

    xs = np.linspace(x_min, x_max, grid_steps)
    ys = np.linspace(y_min, y_max, grid_steps)
    xx, yy = np.meshgrid(xs, ys)
    grid = np.c_[xx.ravel(), yy.ravel()]

    proba = model_proba_fn(grid).reshape(xx.shape)

    fig = go.Figure()
    fig.add_trace(go.Contour(
        x=xs,
        y=ys,
        z=proba,
        colorscale="RdBu",
        opacity=0.75,
        contours=dict(showlines=False),
        colorbar=dict(title="P(class=1)"),
    ))
    fig.add_trace(go.Scatter(
        x=X[:, 0],
        y=X[:, 1],
        mode="markers",
        marker=dict(color=y, colorscale="Viridis", size=6, line=dict(width=0.5, color="white")),
        name="data",
    ))
    fig.update_layout(title=title, width=760, height=520)
    fig.update_yaxes(scaleanchor="x", scaleratio=1)
    return fig


fig1 = plot_probability_surface_2d(lambda Z: expit(add_intercept(Z) @ beta_hat), X_te, y_te, "Scratch logistic: P(class=1)")
fig1.show()

fig2 = plot_probability_surface_2d(lambda Z: sk.predict_proba(Z)[:, 1], X_te, y_te, "sklearn logistic: P(class=1)")
fig2.show()

5) Poisson regression (counts) as a GLM#

When \(y\) is a count, a natural model is Poisson:

\[ (y \mid x) \sim \text{Poisson}(\mu) \]

A common link is the log link:

\[ \log \mu = \eta = x^\top \beta \quad\Rightarrow\quad \mu = \exp(x^\top\beta) \]

This guarantees \(\mu>0\).

IRLS exists here too and is very similar to logistic regression.

def irls_poisson_regression(
    X: np.ndarray,
    y: np.ndarray,
    alpha: float = 0.0,
    max_iter: int = 80,
    tol: float = 1e-8,
):
    """Poisson regression via IRLS with optional L2 penalty (ridge).

    Uses log link: μ = exp(η).
    alpha penalizes coefficients except intercept.
    """

    X = np.asarray(X, dtype=float)
    y = np.asarray(y, dtype=float)

    n, d = X.shape
    beta = np.zeros(d, dtype=float)

    losses = []
    reg = alpha * np.eye(d)
    reg[0, 0] = 0.0

    eps = 1e-12

    for _ in range(max_iter):
        eta = X @ beta
        mu = np.exp(eta)

        w = np.clip(mu, 1e-9, None)
        z = eta + (y - mu) / mu

        sqrt_w = np.sqrt(w)
        Xw = X * sqrt_w[:, None]
        zw = z * sqrt_w

        A = Xw.T @ Xw + reg
        b = Xw.T @ zw

        beta_new = np.linalg.solve(A, b)

        nll = float(np.mean(mu - y * np.log(mu + eps)))
        penalty = 0.5 * alpha * float(np.sum(beta[1:] ** 2))
        losses.append(nll + penalty)

        if np.linalg.norm(beta_new - beta) <= tol * (np.linalg.norm(beta) + tol):
            beta = beta_new
            break

        beta = beta_new

    return beta, np.array(losses)
# Synthetic Poisson regression
n = 1500
Xp = rng.normal(size=(n, 2))
Xp_i = add_intercept(Xp)

beta_true = np.array([0.3, 0.6, -0.4])
rate = np.exp(Xp_i @ beta_true)

yp = rng.poisson(lam=rate)

Xp_tr, Xp_te, yp_tr, yp_te = train_test_split(Xp, yp, test_size=0.3, random_state=7)
Xp_tr_i = add_intercept(Xp_tr)
Xp_te_i = add_intercept(Xp_te)

beta_p, losses_p = irls_poisson_regression(Xp_tr_i, yp_tr, alpha=1.0)

sk_p = PoissonRegressor(alpha=1.0, fit_intercept=True, max_iter=5000)
sk_p.fit(Xp_tr, yp_tr)

print("True beta:", beta_true)
print("Scratch beta:", beta_p)
print("sklearn beta:", np.r_[sk_p.intercept_, sk_p.coef_])
True beta: [ 0.3  0.6 -0.4]
Scratch beta: [ 0.3242  0.6087 -0.3772]
sklearn beta: [ 0.4849  0.3797 -0.2432]
# Predict and compare
mu_hat_scratch = np.exp(Xp_te_i @ beta_p)
mu_hat_sklearn = sk_p.predict(Xp_te)

rmse_scratch = float(np.sqrt(mean_squared_error(yp_te, mu_hat_scratch)))
rmse_sklearn = float(np.sqrt(mean_squared_error(yp_te, mu_hat_sklearn)))

fig = go.Figure()
fig.add_trace(go.Scatter(x=mu_hat_scratch, y=yp_te, mode="markers", name=f"scratch (RMSE={rmse_scratch:.2f})", opacity=0.6))
fig.add_trace(go.Scatter(x=mu_hat_sklearn, y=yp_te, mode="markers", name=f"sklearn (RMSE={rmse_sklearn:.2f})", opacity=0.6))
fig.update_layout(
    title="Poisson regression: predicted mean μ vs observed count y",
    xaxis_title="predicted μ",
    yaxis_title="observed y",
    width=900,
    height=500,
)
fig.show()

6) Gamma / Inverse Gaussian (positive, skewed targets)#

When \(y>0\) and is skewed, Gaussian noise is often a poor match.

Two classic EDM choices:

  • Gamma: variance grows like \(\mu^2\) (common for positive skew)

  • Inverse Gaussian: even heavier variance growth

In scikit-learn, these show up via:

  • GammaRegressor (Gamma deviance)

  • TweedieRegressor(power=3) (Inverse Gaussian deviance)

Tweedie family connection#

A convenient unification is the Tweedie variance function:

\[\mathrm{Var}(y\mid x) \propto \mu^{p}\]
  • \(p=0\) → Normal

  • \(p=1\) → Poisson

  • \(p=2\) → Gamma

  • \(p=3\) → Inverse Gaussian

# A synthetic positive-skew regression dataset

n = 1800
Xg = rng.normal(size=(n, 2))
Xg_i = add_intercept(Xg)

beta_true = np.array([0.2, 0.5, -0.3])
mu_true = np.exp(Xg_i @ beta_true)

# Gamma sample with mean mu_true: choose shape k, scale=mu/k
k_shape = 2.0
scale = mu_true / k_shape

y_gamma = rng.gamma(shape=k_shape, scale=scale)

Xg_tr, Xg_te, yg_tr, yg_te = train_test_split(Xg, y_gamma, test_size=0.3, random_state=7)

m_gamma = GammaRegressor(alpha=1.0, fit_intercept=True, max_iter=5000)
m_gamma.fit(Xg_tr, yg_tr)

m_ig = TweedieRegressor(power=3.0, alpha=1.0, link='log', fit_intercept=True, max_iter=5000)
m_ig.fit(Xg_tr, yg_tr)

pred_gamma = m_gamma.predict(Xg_te)
pred_ig = m_ig.predict(Xg_te)

fig = go.Figure()
fig.add_trace(go.Scatter(x=pred_gamma, y=yg_te, mode="markers", name="GammaRegressor", opacity=0.6))
fig.add_trace(go.Scatter(x=pred_ig, y=yg_te, mode="markers", name="Tweedie power=3 (InvGauss)", opacity=0.6))
fig.update_layout(title="Positive skew regression: predictions vs observed", xaxis_title="predicted mean", yaxis_title="observed y", width=900, height=520)
fig.show()

print("GammaRegressor coef:", np.r_[m_gamma.intercept_, m_gamma.coef_])
print("InvGauss(Tweedie p=3) coef:", np.r_[m_ig.intercept_, m_ig.coef_])
GammaRegressor coef: [ 0.2419  0.2458 -0.1576]
InvGauss(Tweedie p=3) coef: [ 0.1594  0.2339 -0.1522]

7) Multiclass (categorical) GLM intuition#

For \(K\) classes, you can model:

\[ P(y=k\mid x) = \text{softmax}_k(\eta)\quad\text{where}\quad \eta_k = x^\top \beta_k \]

This is a categorical (multinomial) GLM. The deviance corresponds to cross-entropy.

In practice, sklearn.linear_model.LogisticRegression with a multinomial-capable solver implements this.

# Quick multiclass demo (sklearn)
X3, y3 = make_blobs(
    n_samples=900,
    centers=[(-2, 0), (2, 0), (0, 2)],
    cluster_std=[1.1, 1.1, 1.1],
    random_state=7,
)
X3_tr, X3_te, y3_tr, y3_te = train_test_split(X3, y3, test_size=0.3, random_state=7, stratify=y3)

m_multi = LogisticRegression(solver='lbfgs', max_iter=5000)
m_multi.fit(X3_tr, y3_tr)

print("Multiclass accuracy:", accuracy_score(y3_te, m_multi.predict(X3_te)))

fig = px.scatter(x=X3_te[:, 0], y=X3_te[:, 1], color=y3_te.astype(str), title="3-class dataset", labels={"x":"x1","y":"x2","color":"class"})
fig.update_traces(marker=dict(size=6, opacity=0.7))
fig.update_layout(width=720, height=470)
fig.show()
Multiclass accuracy: 0.8925925925925926

8) Summary#

  • GLMs keep the linear predictor \(\eta=X\beta\), but change the mapping to the mean and the loss.

  • The distribution choice gives you the right domain and right notion of error (deviance).

  • IRLS gives an intuitive optimization story: repeatedly solve a weighted least-squares problem.

Exercises#

  1. Implement IRLS for Gamma with log link (harder but very educational).

  2. Create a dataset with over-dispersed counts and compare Poisson vs a Tweedie model.

  3. Change alpha and observe coefficient shrinkage for PoissonRegressor.

References#

  • scikit-learn: GLM and Tweedie regressors

  • “Generalized Linear Models” (McCullagh & Nelder)